NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

PerceptionLM: Open-Access Data and Models for Detailed Visual Understanding

Cho, JH; Madotto, A; Mavroudi, E; Afouras, T; Nagarajan, T; Maaz, M; Song, Y; Ma, T; Hu, S; Jain, S; et al (July 2025, https://doi.org/10.48550/arXiv.2504.13180)

Vision-language models are integral to computer vision research, yet many high-performing models remain closed-source, obscuring their data, design and training recipe. The research community has responded by using distillation from black-box models to label training data, achieving strong benchmark results, at the cost of measurable scientific progress. However, without knowing the details of the teacher model and its data sources, scientific progress remains difficult to measure. In this paper, we study building a Perception Language Model (PLM) in a fully open and reproducible framework for transparent research in image and video understanding. We analyze standard training pipelines without distillation from proprietary models and explore large-scale synthetic data to identify critical data gaps, particularly in detailed video understanding. To bridge these gaps, we release 2.8M human-labeled instances of fine-grained video question-answer pairs and spatio-temporally grounded video captions. Additionally, we introduce PLM-VideoBench, a suite for evaluating challenging video understanding tasks focusing on the ability to reason about "what", "where", "when", and "how" of a video. We make our work fully reproducible by providing data, training recipes, code & models.
more » « less
Free, publicly-accessible full text available July 23, 2026
ZipIt! Merging Models from Different Tasks without Training

Stoica, G; Bolya, D; Bjorner, J; Ramesh, P; Hearn, T; Hoffman, J (January 2024, International Conference on Learning Representations)

Typical deep visual recognition models are capable of performing the one task they were trained on. In this paper, we tackle the extremely difficult problem of combining distinct models with different initializations, each solving a separate task, into one multi-task model without any additional training. Prior work in model merging permutes one model to the space of the other then averages them together. While this works for models trained on the same task, we find that this fails to account for the differences in models trained on disjoint tasks. Thus, we introduce "ZipIt!", a general method for merging two arbitrary models of the same architecture that incorporates two simple strategies. First, in order to account for features that aren't shared between models, we expand the model merging problem to allow for merging features within each model by defining a general "zip" operation. Second, we add support for partially zipping the models up until a specified layer, naturally creating a multi-head model. We find that these two changes combined account for 20-60% improvement over prior work, making it more feasible to merge models trained on disjoint tasks without retraining.
more » « less
YOLACT: Real-time Instance Segmentation

https://doi.org/10.1109/ICCV.2019.00925

Bolya, D.; Zhou, C.; Xiao, F.; Lee, Y. J. (October 2019, Proceedings of the IEEE International Conference on Computer Vision (ICCV))

Full Text Available

Search for: All records